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Abstract — Classical rate-distortion theory requires 
knowledge of an elusive source distribution. Instead, we 
analyze rate-distortion properties of individual objects 
using the recently developed algorithmic rate-distortion 
theory. The latter is based on the noncomputable notion 
of Kolmogorov complexity. To apply the theory we 
approximate the Kolmogorov complexity by standard 
data compression techniques, and perform a number 
of experiments with lossy compression and denoising 
of objects from different domains. We also introduce 
a natural generalization to lossy compression with side 
information. To maintain full generality we need to 
address a difficult searching problem. While our solutions 
are therefore not time efficient, we do observe good 
denoising and compression performance. 

Index Terms — compression, denoising, rate-distortion, 
structure function, Kolmogorov complexity 

I. Introduction 

Rate-distortion theory analyzes communication over a 
channel under a constraint on the number of transmitted 
bits, the "rate". It currently serves as the theoretical 
underpinning for many important applications such as 
lossy compression and denoising, or more generally, 
applications that require a separation of structure and 
noise in the input data. 

Classical rate-distortion theory evolved from Shan- 
non's theory of communication[l]. It studies the trade- 
off between the rate and the achievable fidelity of the 
transmitted representation under some distortion func- 
tion, where the analysis is carried out in expectation 
under some source distribution. Therefore the theory can 
only be meaningfully applied if we have some reasonable 
idea as to the distribution on objects that we want to 
compress lossily. While lossy compression is ubiquitous, 
propositions with regard to the underlying distribution 
tend to be ad-hoc, and necessarily so, because (1) it 
is a questionable assumption that the objects that we 
submit to lossy compression are all drawn from the same 
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probability distribution, or indeed that they are drawn 
from a distribution at all, and (2) even if a true source 
distribution is known to exist, in most applications the 
sample space is so large that it is extremely hard to 
determine what it is like: objects that occur in practice 
very often exhibit more structure than predicted by the 
used source model. 

For large outcome spaces then, it becomes important 
to consider structural properties of individual objects. 
For example, if the rate is low, then we may still be 
able to transmit objects that have a very regular structure 
without introducing any distortion, but this becomes 
impossible for objects with high information density. 
This point of view underlies some recent research in 
the lossy compression community [2]. At about the same 
time, a rate-distortion theory which allows analysis of 
individual objects has been developed within the frame- 
work of Kolmogorov complexity [3]. It defines a rate- 
distortion function not with respect to some elusive 
source distribution, but with respect to an individual 
source word. Every source word thus obtains its own 
associated rate-distortion function. 

We will first give a brief intoduction to algorithmic 
rate-distortion theory in Section HI] We also describe a 
novel generalization of the theory to settings with side 
information, and we describe two distinct applications 
of the theory, namely lossy compression and denoising. 

Algorithmic rate-distortion theory is based on Kol- 
mogorov complexity, which is not computable. We nev- 
ertheless cross the bridge between theory and practice 
in Section|III| by approximating Kolmogorov complexity 
by the compressed size of the object by a general purpose 
data compression algorithm. Even so, approximating the 
rate-distortion function is a difficult search problem. We 
motivate and outline the genetic algorithm we used to 
approximate the rate-distortion function. 

In Section IIVI we describe four experiments in lossy 
compression and denoising. The results are presented 
and discussed in Section |V] Then, in Section IVII we 
take a step back and discuss to what extent our practical 



approach yields a faithful approximation of the theoret- 
ical algorithmic rate-distortion function. We end with a 
conclusion in Section IVIll 

II. Algorithmic Rate-Distortion 

Suppose we want to communicate objects x from a 
set of source words X using at most r bits per object. 
We call r the rate. We locate a good representation of x 
within a finite set y, which may be different from X in 
general (but we usually have ^ = 3^ in this text). The 
lack of fidelity of a representation y is quantified by a 
distortion function d : X x y ^ R. 

The Kolmogorov complexity of y, denoted K{y), is 
the length of the shortest program that constructs y. More 
precisely, it is the length of the shortest input to a fixed 
universal binary prefix machine that will output y and 
then halt; also see the textbook[3]. We can transmit any 
representation y that has K{y) < r, the receiver can 
then run the program to obtain y and is thus able to 
reconstruct x up to distortion d{x, y). Define the rate- 
distortion profile Pj. of the source word x as the set of 
pairs (r, a) such that there is a representation y G y with 
<i{x,y) < a and K{y) < r. The possible combinations 
of r and a can also be characterised by the rate -distortion 
function of the source word x, which is defined as 
Txio) = min{r : (r, a) G Px}, or by the distortion- 
rate function of the source word x, which is defined as 
dxir) = min{a : (r, a) G Px}- These two functions are 
somewhat like inverses of each other; although strictly 
speaking they are not since they are monotonic but not 
strictly monotonic. A representation y is said to witness 
the rate-distortion function of x if rx{d{x,y)) = K{y). 
These definitions are illustrated in Figure [0 

Algorithmic rate-distortion theory is developed and 
treated in much more detail in [4]. It is a generalization 
of Kolmogorov's structure function theory, see [5]. 

A. Side Information 

We generalize the algorithmic rate-distortion frame- 
work, so that it can accommodate side information. 
Suppose that we want to transmit a source word x G 
X and we have chosen a representation ?/ G 3^ as 
before. The encoder and decoder often share a lot of 
information: both might know that grass is green and the 
sky is blue, they might share a common language, and so 
on. They would not need to transmit such information. If 
encoder and decoder share some information z, then the 
programs they transmit to compute the representation y 
may use this side information z. Such programs might 
be shorter than they could be otherwise. This can be 
formalised by switching to the conditional Kolmogorov 




Representations that witness the rate-distortion function 



Fig. 1. Rate-distortion profile and distortion-rate function 

complexity K{y\z), which is the length of the shortest 
Turing machine program that constructs y on input z. We 
redefine K{y) — K{y\e), where e is the empty sequence, 
so that K{y\z) < K{y)-^0{l):tht length of the shortest 
program for y can never significantly increase when side 
information is provided, but it might certainly decrease 
when y and z share a lot of information[3]. We change 
the definitions as follows: The rate -distortion profile of 
the source word x with side information z is the set of 
pairs (r, a) such that there is a representation y G 3^ 
with d{x, y) < a and K{y\z) < r. The definitions of the 
rate-distortion function and the distortion-rate function 
are similarly changed. Henceforth we will omit mention 
of the side information z unless it is relevant to the 
discussion. 

While this generaUzation seems very natural the au- 
thors are not aware of earlier proposals along these 
lines. In Section|y]we will demonstrate one use for this 
generalized rate-distortion theory: removal of spelling 
errors in written text, an example where denoising is 
not practical without use of side information. 

B. Distortion Spheres and the Minimal Sufficient Statis- 
tic 

A representation y that witnesses the rate-distortion 
function is the best possible rendering of the source 
object X at the given rate because it minimizes the 
distortion, but if the rate is lower than K{x), then some 
information is necessarily lost. Since one of our goals is 
to find the best possible separation between structure and 
noise in the data, it is important to determine to what 
extent the discarded information is noise. 
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Given a representation y and the distortion a = 
d{x, y), we can find the source object x somewhere on 
the Ust of all x' & X that satisfy d{x',y) = a. The 
information conveyed about x by y and a is precisely, 
that X can be found on this list. We call such a list a 
distortion sphere. A distortion sphere of radius a, centred 
around y is defined as follows: 

Sy{a) := {x' gX: d{x', y) = a). (1) 

If the discarded information is pure white noise, then 
this means that x must be a completely random element 
of this list. Conversely, all random elements in the list 
share all "simply described" (in the sense of having 
low Kolmogorov complexity) properties that x satisfies. 
Hence, with respect to the "simply described" properties, 
every such random element is as good as x, see [4] for 
more details. In such cases a literal specification of the 
index of x (or any other random element) in the list is 
the most efficient code for x, given only that it is in 
Sy{a). A fixed-length, literal code requires log|5j,(a)| 
bits. (Here and in the following, all logarithms are 
taken to base 2 unless otherwise indicated.) On the 
other hand, if the discarded information is structured, 
then the Kolmogorov complexity of the index of x in 
Sy{a) will be significantly lower than the logarithm of 
the size of the sphere. The difference between these 
two codelengths can be used as an indicator of the 
amount of structural information that is discarded by the 
representation y. Vereshchagin and Vitanyi[4] call this 
quantity the randomness deficiency of the source object 
X in the set Sy{a), and they show that if y witnesses the 
rate-distortion function of x, then it minimizes the ran- 
domness deficiency at rate K{y); thus the rate-distortion 
function identifies those representations that account for 
as much structure as possible at the given rate. 

To assess how much structure is being discarded at 
a given rate, consider a code for the source object x 
in which we first transmit the shortest possible program 
that constructs both a representation y and the distortion 
d{x,y), followed by a literal, fixed-length index of x 
in the distortion sphere Sy{a). Such a code has the 
following length function: 

Ly{x) := K{{y,d{x,y))) + log\Sy{dix,y))\. (2) 

If the rate is very low then the representation y models 
only very basic structure and the randomness deficiency 
in the distortion sphere around y is high. Borrowing 
terminology from statistics, we may say that y is a 
representation that "underfits" the data. In such cases we 
should find that Ly{x) > K{x), because the fixed-length 
code for the index of x within the distortion sphere is 



suboptimal in this case. But suppose that y is complex 
enough that it satisfies Ly{x) « K{x). In [4], such rep- 
resentations are called (algorithmic) sufficient statistics 
for the data x. A sufficient statistic has close to zero 
randomness deficiency, which means that it represents 
all structure that can be detected in the data. However, 
sufficient statistics might contain not only structure, but 
noise as well. Such a representation would be overly 
complex, an example of overfitting. A minimal sufficient 
statistic balances between underfitting and overfitting. It 
is defined as the lowest complexity sufficient statistic, 
which is the same as the lowest complexity represen- 
tation y that minimizes the codelength Ly{x). As such 
it can also be regarded as the "model" that should be 
selected on the basis of the Minimum Description Length 
(MDL) principle[6]. To be able to relate the distortion- 
rate function to this codelength we define the codelength 
function \x{r) = Ly{x) where y is the representation 
that minimizes the distortion at rate r} 

C. Applications: Denoising and Lossy Compression 

Representations that witness the rate-distortion func- 
tion provide optimal separation between structure that 
can be expressed at the given rate and residual in- 
formation that is perceived as noise Therefore, these 
representations can be interpreted as denoised versions of 
the original. In denoising, the goal is of course to discard 
as much noise as possible, without losing any structure. 
Therefore the minimal sufficient statistic, which was 
described in the previous section, is the best candidate 
for applications of denoising. 

While the minimum sufficient statistic is a denoised 
representation of the original signal, it is not necessarily 
given in a directly usable form. For instance, y could 
consist of subsets of X, but a set of source-words is not 
always acceptable as a denoising result. So in general 
one may need to apply some function f : y ^ X lo 
the sufficient statistic to construct a usable object. But 
if X = y and the distortion function is a metric, as 
in our case, then the representations are already in an 
acceptable format, so here we use the identity function 
for the transformation /. 

In applications of lossy compression, one may be 
willing to accept a rate which is lower than the mini- 
mal sufficient statistic complexity, thereby losing some 
structural information. However, for a minimal sufficient 
statistic y, theory does tell us that it is not worthwhile 
to set the rate to a higher value than the complexity 

'This is superficially similar to the MDL function defined in [5], but 
it is not exactly the same since it involves optimisation of the distortion 
at a given rate rather than direct optimisation of the code length. 
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of y. The original object a; is a random element of 
Sy{d{x,y)), and it cannot be distinguished from any 
other random z G Sy{d{x,y)) using only "simply 
described" properties. So we have no "simply described" 
test to discredit the hypothesis that x (or any such z) is 
the original object, given y and d{x,y). But increasing 
the rate, yielding a model y' and d{x, y') < d{x, y), we 
commonly obtain a sphere Syi of smaller cardinality than 
Sy, with some random elements of Sy not being random 
elements of Syi. These excluded elements, however, 
were perfectly good candidates of being the original 
object. That is, at rate higher than that of the minimal 
sufficient statistic, the resulting representation y' models 
irrelevant features that are specific to x, that is, noise 
and no structure, that exclude viable candidates for the 
olriginal object: the representation starts to "overfit". 

In lossy compression, as in denoising, the represen- 
tations themselves may be unsuitable for presentation 
to the user. For example, when decompressing a lossily 
compressed image, in most applications a set of images 
would not be an acceptable result. So again a transfor- 
mation from representations to objects of a usable form 
has to be specified. There are two obvious ways of doing 
this: 

1) If a representation y witnesses the rate-distortion 
function for a source word x G X, then this means 
that X cannot be distinguished from any other 
object x' e Sy{d{x,y)) at rate K{y). Therefore 
we should not use a deterministic transforma- 
tion, but rather report the uniform distribution on 
Sy{d{x, y)) as the lossily compressed version of x. 
This method has the advantage that it is applicable 
whether or not X = y. 

2) On the other hand, if X = y and the distortion 
function is a metric, then it makes sense to use 
the identity transformation again, although here the 
motivation is different. Suppose we select some 
x' G Sy{d{x,y)) instead of y. Then the best 
upper bound we can give on the distortion is 
d{x,x') < d{x,y) + d{y,x') = 2d{x,y) (by the 
triangle inequality and symmetry). On the other 
hand if we select y, then the distortion is exactly 
d{x, y), which is only half of the upper bound we 
obtained for x'. Therefore it is more suitable if one 
adopts a worst-case approach. This method has as 
an additional advantage that the decoder does not 
need to know the distortion d{x, y) which often 
cannot be computed from y without knowledge of 

X. 

To illustrate the difference one may expect from these 
approaches, consider the situation where the rate is lower 



than the rate that would be required to specify a sufficient 
statistic. Then intuitively, all the noise in the source 
word X as well as some of the structure are lost by 
compressing it to a representation y. The second method 
immediately reports y, which contains a lot less noise 
than the source object x; thus x and y are qualitatively 
different, which may be undesirable. On the other hand, 
the compression result will be qualitatively different 
from X anyway, because the rate simply is too low to 
retain all structure. If one would apply the first approach, 
then a result x' would likely contain more noise than the 
original, because it contains less structure at the same 
level of distortion (meaning that K(x') > K(x) while 
d{x',y) = d{x,y)). 

If the rate is high enough to transmit a sufficient 
statistic, then the first approach seems preferable. We 
have nevertheless chosen to always report y directly in 
our analysis, which has the advantage that this way, all 
reported results are of the same type. 

111. Computing Individual Object 
Rate-Distortion 

The rate-distortion function for an object x with side 
information z and a distortion function d is found by 
simultaneous minimizing two objective functions 

ffi(2/) = K{y\z) (3) 
g2{y) = d{x,y) (4) 

9{y) = {9i{y),92{y)) ■ 

We call the tuple g{y) the trade-off of y. We impose a 
partial order on representations: 

y^y' (5) 

if and only if gi{y) < gi{y') and 52(2/) < 52 (y')- Our 
goal is to find the set of Pareto-optimal representations, 
that is, the set of representations that are minimal under 

Such an optimisation problem cannot be implemented 
because of the uncomputabihty of K{-). To make the 
idea practical, we need to approximate the conditional 
Kolmogorov complexity. As observed in [7], it follows 
directly from symmetry of information for Kolmogorov 
complexity (see [3, p. 233]) that: 

K{y\z) = K{zy)-K{z) + 0{logn), (6) 

where n is the length of zy. Ignoring the logarithmic 
term, this quantity can be approximated by replacing 
K{-) by the length of the compressed representation 
under a general purpose compression algorithm A: X ^ 
{0, 1}*. The length of the compressed representation of 
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X is denoted by La{x). This way we obtain, up to an 
additive independent constant: 

K{v\z)<LA{zy)-LA{z). (7) 

We redefine gi{y) :— LA{zy) — La{z) in order to get a 
practical objective function. 

This may be a poor approximation: we only have that 
< K{y\z) < LA{zy) — La{z), up to a constant, 
so the compressed size is an upper bound that may 
be quite high even for objects that have Kolmogorov 
complexity close to zero. Our results show evidence 
that some of the theoretical properties of the distortion- 
rate function nevertheless carry over to the practical 
setting; we also explain how some observations that 
are not predicted by theory are in fact related to the 
(unavoidable) inefficiencies of the used compressor 

A. Compressor ( rate function ) 

We could have used any general-purpose compressor 
in 0, but we chose to implement our own for three 
reasons: 

« It should be both fast and efficient. We can gain 
some advantage over other available compressors 
because there is no need to actually construct a 
code. It suffices to compute code lengths, which is 
much easier. As a secondary advantage, the code- 
lengths we compute are not necessarily multiples of 
eight bits: we allow rational idealised codelengths, 
which may improve precision. 
• It should not have any arbitrary restrictions or opti- 
misations. Most general purpose compressors have 
limited window sizes or optimisations to improve 
compression of common file types; such features 
could make the results harder to interpret. 
In our experiments we used a block sorting compression 
algorithm with a move-to-front scheme as descibed in 
[8]. In the encoding stage M2 we employ a simple 
statistical model and omit the actual encoding as it 
suffices to accumulate codelengths. The source code of 
our implementation (in C) is available from the authors 
upon request. The resulting algorithm is very similar 
to a number of common general purpose compressors, 
such the freely available bzip2[9] and zzip[10], but it is 
simpler and faster for small inputs. 

Of course, domain specific compressors might yield 
better compression for some object types (such as sound 
wave files), and therefore a better approximation of 
the Kolmogorov complexity. However, the compressor 
that we implemented is quite efficient for objects of 
many of the types that occur in practice; in particular 
it compressed the objects of our experiments (text and 



small images) quite well. We have tried to improve 
compression performance by applying standard image 
preprocessing algorithms to the images, but this turned 
out not to improve compression at all. Figure |2] lists the 
compressed size of an image of a mouse under various 
different compression and filtering regimes. Compared to 
other compressors, ours is quite efficient; this is probably 
because other compressors are optimised for larger files 
and because we avoid all overhead inherent in the 
encoding process. Most compressors have optimisations 
for text files which might explain why our compressor 
compares less favourably on the Oscar Wilde fragment. 



B. Codelength Function 

In Section lTl-Bl we introduced the codelength function 
Xx{r). Its definition makes use of 0, for which we have 
not yet provided a computable alternative. We use the 
following approximation: 

Ly{x) « Ly{x) = LA{y)+LD{d{x,y)\y)+Lu{x\y,d{x,y)), 

(8) 

where Ld is yet another code which is necessary to 
specify the radius of the distortion sphere around y in 
which X can be found. It is possible that this distortion 
is uniquely determined by y, for example if 3^ is the set 
of all finite subsets of X and list decoding distortion is 
used, as described in [5]. If d{x, y) is a function of y then 
Lu{d{x,y)\y) — 0. In other cases, the representations 
do not hold sufficient information to determine the 
distortion. This is typically the case when A" = 3^ as in 
the examples in this text. In that case we actually need 
to encode d{x, y) separately. It turns out that the number 
of bits that are required to specify the distortion are 
negligible in proportion to the total three part codelength. 
In the remainder of the paper we use for Ld ^ universal 
code on the integers similar to the one described in [3]; 
it has codelength Loid) = log(d+ 1) + O (log log d). 



C. Distortion Functions 

We use three common distortion functions which we 
describe below. All distortion functions used in this text 
are metrics, which have the property that X — y. 

a) Hamming distortion: Hamming distortion is 
perhaps the simplest distortion function that could be 
used. Let x and y be two objects of equal length n. The 
Hamming distortion d{x, y) is equal to the number of 
symbols in x that do not match those in the correspond- 
ing positions in y. 
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Fig. 2. Compressed sizes of three objects tliat we experiment upon. See Figure |5]for tlie mouse. Figure l6lfor the cross with added noise and 
Figure lTol for the corrupted Oscar Wilde fragment (the middle version). In the latter we give the compressed size conditional on a training text, 
like in the experiments. "A" is our own algorithm, described in Section IlII-AI For a description of the filters see [11]. 



b) Euclidean distortion: As before, let x = 
xi . . .Xn and y = j/i . . . y„ be two objects of 
equal length, but the symbols now have a numeri- 
cal interpretation. Euclidean distortion is d{x,y) — 

~ Vi)'^- the distance between x and y when 
they are interpreted as vectors in an n-dimensional 
Euclidean space. Note that this definition of Euclidean 
distortion differs from the one in [4]. 

c) Edit distortion: The edit distortion of two strings 
X and y, of possibly different lengths, is the minimum 
number of symbols that have to be deleted from, inserted 
into, or changed in x in order to obtain y (or vice 
versa)[12]. It is also known as Levenshtein distortion. 
It is a well-known measure that is often used in appli- 
cations that require approximate string matching. 

D. Searching for the Rate-Distortion Function 

The search problem that we propose to address has 
two properties that make it very hard. Firstly, the search 
space is enormous: for an object of n bits there are 2" 
candidate representations of the same size, and objects 
that are typically subjected to lossy compression are 
often millions or billions of bits long. Secondly, we want 
to avoid making too many assumptions about the two 
objective functions, so that we can later freely change 
the compression algorithm and the distortion function. 
Under such circumstances the two most obvious search 
methods are not practical: 

• An exhaustive search is infeasible for search spaces 
of such large size, unless more specific properties 
of the objective functions are used in the design 
of the algorithm. To investigate how far we could 
take such an approach, we have implemented an 
exhaustive algorithm under the requirement that, 
given a prefix of a representation y, we can compute 
reasonable lower bounds on the values of both 



objective functions gi and g2- This allows for rela- 
tively efficient enumeration of all representations of 
which the objective functions do not exceed specific 
maxima: it is never necessary to consider objects 
which have a prefix for which the lower bounds 
exceed the constraints, which allows for significant 
pruning. In this fashion we were able to find the 
rate-distortion function under Hamming distortion 
for objects of which the compressed size is about 
25 bits or less within a few hours on a desk-top 
computer. 

• A greedy search starts with a poor solution and 
iteratively makes modifications that constitute strict 
improvements. We found that this procedure tends 
to terminate quickly in some local optimum that is 
very bad globally. 

Since the structure of the search landscape is at present 
poorly understood and we do not want to make any 
unjustifiable assumptions, we use a genetic search algo- 
rithm which performs well enough that interesting results 
can be obtained. It is described in the Appendix □ 

IV. Experiments 

We have subjected four objects to our program. The 
following considerations have influenced our choice of 
objects: 

• Objects should not be too complex, allowing our 
program to find a good approximation of the 
distortion-rate curve. We found that the running 
time of the program seems to depend mostly on the 
complexity of the original object; a compressed size 
of 20,000 bits seemed to be about the maximum our 
program could handle within a reasonable amount 
of time, requiring a running time of the order of 
weeks on a desk-top computer 

• To check that our method really is general, objects 
should be quite different: they should come from 
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different object domains, for which different dis- 
tortion functions are appropriate, and they should 
contain structure at different levels of complexity. 
• Objects should contain primary structure and reg- 
ularities that are distinguishable and compressible 
by a block sorting compressor such as the one we 
use. Otherwise, we may no longer hope that the 
compressor implements a significant approximation 
of the Kolmogorov complexity. For instance, we 
would not expect our program to do well on a 
sequence of digits from the binary expansion of the 
number tt. 

With this in mind, we have selected the objects listed in 
Figure |3l 

In each experiment, as time progressed the program in 
Appendix b found less and less improvements per itera- 
tion, but the set of candidate solutions, called the pool in 
the Appendix, never stabilized completely. Therefore we 
interrupted each experiment when (a) after at least one 
night of computation, the pool did not improve a lot, and 
(b) for all intuitively good models y G 3^ that we could 
conceive of a priori, the algorithm had found an y' in the 
pool with y' ^ y according to (|5j- For example, in each 
denoising experiment, this test included the original, 
noiseless object. In the experiment on the mouse without 
added noise, we also included the images that can be 
obtained by reducing the number of grey levels in the 
original with an image manipulation program. Finally for 
the greyscale images we included a number of objects 
that can be obtained by subjecting the original object to 
JPEG2000 compression at various quality levels. 

The first experiment illustrates how algorithmic rate- 
distortion theory may be applied to lossy compression 
problems, and it illustrates how for a given rate, some 
features of the image are preserved while others can 
no longer be retained. We compare the performance of 
our method to the performance of JPEG and JPEG2000 
at various quality levels. Standard JPEG images were 
encoded using the ImageMagick version 6.2.2; profile in- 
formation was stripped. JPEG2000 images were encoded 
to jpc format with three quality levels using NetPBM 
version 10.33.0; all other options are default. For more 
information about these software packages refer to [13] 
and [14]. 

The other three experiments are concerned with de- 
noising. Any model that is output by the program can be 
interpreted as a denoised version of the input object. We 
measure the denoising success of a model y as d{x' , y), 
where x' is the original version of the input object x, 
before noise was added. We also compare the denoising 
results to those of other denoising algorithms: 



1) BayesShrink denoising[15]. BayesShrink is a pop- 
ular wavelet-based denoising method that is con- 
sidered to work well for images. 

2) Blurring (convolution with a Gaussian kernel). 
Blurring works like a low-pass filter, eliminating 
high frequency information such as noise. Unfortu- 
nately other high frequency features of the image, 
such as sharp contours, are also discarded. 

3) Naive denoising. We applied a naive denoising 
algorithm to the noisy cross, in which each pixel 
was inverted if five or more out of the eight 
neighbouring pixels were of different colour. 

4) Denoising based on JPEG2000. Here we subjected 
the noisy input image to JPEG2000 compression at 
different quality levels. We then selected the result 
for which the distortion to the original image was 
lowest. 

A. Names of Objects 

To facilitate description and discussion of the exper- 
iments we will adopt the following naming convention. 
Objects related to the experiments with the mouse, the 
noisy cross, the noisy mouse and the Wilde fragment, are 
denoted by the symbols M, C, N and W respectively. A 
number of important objects in each experiment are iden- 
tified by a subscript as follows. For O G {M, C, N, W}, 
the input object, for which the rate-distortion function 
is approximated by the program, is called Oin, which 
is sometimes abbreviated to O. In the denoising ex- 
periments, the input object is always constructed by 
adding noise to an original object. The original objects 
and the noise are called Oqrig and Onoise respectively. 
If Hamming distortion is used, addition is carried out 
modulo 2, so that the input object is in effect a pixelwise 
exclusive OR of the original and the noise. In particular, 
CiN equals Cqrig XOR Cnoise- The program outputs the 
reduction of the gene pool, which is the set of considered 
models. Two important models are also given special 
names: the model within the gene pool that minimizes 
the distortion to Oqrig constitutes the best denoising of 
the input object and is therefore called Obest, and the 
minimal sufficient statistic as described in Section Hl-BI 
is called Omss- Finally, in the denoising experiments we 
also give names to the results of the alternative denoising 
algorithms. Namely, Cnaive is the result of the naive 
denoising algorithm applied to the noisy cross, Nblur is 
the convolution of N with a Gaussian kernel with a = 
0.458, Nbs is the denoising result of the BayesShrink 
algorithm, and Njpeg2()00 is the image produced by 
subjecting N to JPEG2000 compression at the quality 
level for which the distortion to Nqrig is minimized. 
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A picture of a mouse of 64 X 40 pixels. 
The picture is analyzed witli respect to 
Euclidean distortion. 



The same picture of a mouse, but now 
zero mean Gaussian noise with a = 8 
has been added to each pixel. Euclidean 
distortion is used; the distortion to the 
original mouse is 391.1. 




Beauty, real beauty, en(Js2wheresan 
intelleettial expressoon begins. IntellHct 
isg in itself a mMtie ofSexggeration, an\ 
destroys theLhamiony of n face. [. . . ] 

(See Fisure Uoi 



A noisy monochrome image of 64 X 64 
pixels that depicts a cross. 377 pixels 
have been inverted. Hamming distortion 
is used. 



A corrupted quotation from Chapter 1 
of The Picture of Dorian Gray, by Os- 
car Wilde. The 733 byte long fragment 
was created by performing 68 random 
insertions, deletions and replacements 
of characters in the oiiginal text. Edit 
distortion is used. The rest of chapters 
one and two of the novel are given to 
the program as side information. 



Fig. 3. The four objects that are subjected to rate-distortion analysis. 



V. Results and Discussion 

We will occasionally use terminology from AppendiXn 
but such references can safely be glossed over on first 
reading. After running for some time on each input 
object, our program outputs the reduction of a pool 
V, which is interpreted as a set of models. For each 
experiment, we report a number of different properties 
of these sets. Since we are interested in the rate-distortion 
properties of the input object x = Oin, we plot the 
approximation of the distortion-rate function of each 
input object: dx{r) — mm{d{x,y) : y G y,K{y) < 
r} « mm{d{x,y) : y G trd{V), LA{y) < r}, where La 
denotes the codelength for an object under our compres- 
sion algorithm. Such approximations of the distortion- 
rate function are provided for all four experiments. For 
the greyscale images we also plot the distortion-rate 
approximation that is achieved by JPEG2000 (and in 
Figure |5l also ordinary JPEG) at different quality levels. 
Here, the rate is the codelength achieved by JPEG(2000), 
and the distortion is the Euclidean distortion to Oin- We 
also plot the codelength function as discussed in Sec- 
tion H^B] Minimal sufficient statistics can be identified 
by locating the minimum of this graph. 

A. Lossy Compression 

Experiment 1: Wlouse (Euclidean distortion): Our first 
experiment involved the lossy compression of M, a 
greyscale image of a mouse. A number of elements of 
the gene pool are shown in Figure |4] The pictures show 
how at low rates, the models capture the most important 
global structures of the image; at higher rates more subtle 
properties of the image can be represented. Figure |4(a)| 
shows a rough rendering of the distribution of bright and 
dark areas in Min. These shapes are rectangular, which 



is probably an artifact of the compression algorithm 
we used: it is better able to compress images with 
rectangular structure than with circular structure. There 
is no real reason why a circular structure should be 
in any way more complex than a rectangular structure, 
but most general purpose data compression software is 
similarly biased. In |4(b)| the rate is high enough that 
the oval shape of the mouse can be accommodated, and 
two areas of different overall brightness are identified. 
After the number of grey shades has been increased a 
little further in |4(c)| the first hint of the mouse's eyes 
becomes visible. The eyes are improved and the mouse 
is given paws in |4(d)| At higher rates, the image becomes 
more and more refined, but the improvements are subtle 
and seem of a less qualitative nature. 

Figure |5(b)| shows that the only sufficient statistic in 
the set of models is Min itself, indicating that the image 
hardly contains any noise. It also shows the rates that 
correspond to the models that are shown in Figure 0] 
By comparing these figures it can be clearly seen that 
the image quality only starts to deteriorate significantly 
after more than half of the information in Min has been 
discarded. Note that this is not a statement about the 
compression ratio, where the size is related to the size 
of the uncompressed object. For example, Min has an 
uncompressed size of 64 • 40 • 8 = 20480 bits, and 
the representation in Figure |4(g)| has a compressed size 
of 3190.6 bits. This representation therefore constitutes 
compression by a factor of 20480/3190.6 = 6.42, which 
is substantial for an image of such small size. At the 
same time the amount of information is reduced by a 
factor of 7995.0/3190.6 = 2.51. 
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(a) r = 163.0, d = 2210.0 



(b) r = 437.8, d=1080.5 





(a) Approximate distortion-rate function 




Fig. 5. Results for the Mouse fFigureRl. 
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B. Denoising 

For each denoising experiment, we report a number 
of important objects, a graph that shows the approxi- 
mate distortion-rate function and a graph that shows the 
approximate codelength function. In the distortion-rate 
graph we plot not only the distortion to Ojn but also the 
distortion to Oorio, to visualise the denoising success at 
each rate. 

In interpreting these results, it is important to realise 
that only the reported minimal sufficient statistic and the 
results of the BayesShrink and naive denoising methods 
can be obtained without knowledge of the original object 
- the other objects Obest, Ojpeg2000 ™d Oblur require 
selecting between a number of alternatives in order to 
optimise the distortion to Oorig, which can only be done 
in a controlled experiment. Their performance may be 
better than what can be achieved in practical situations 
where Oorig is not known. 

Experiment 2: Noisy Cross (Hamming distortion): 
In the first denoising experiment we approximated the 
distortion-rate function of a monochrome cross Cqrio of 
very low complexity, to which artificial noise was added 
to obtain Cin. Figure |6l shows the result; the distortion 
to the noiseless cross is displayed in the same graph. 
The best denoising Cbest has a distortion of only 3 to 
the original Cqrig, which shows that the distortion-rate 
function indeed separates structure and noise extremely 
well in this example. Figure |6(b)| shows the codelength 
function for the noisy cross; the minimum on this graph 
is the minimal sufficient statistic Cmss- In this low 
complexity example, we have Cmss = Cbest, so the best 
denoising is not only very good in this simple example, 
but it can also be identified. 

We did not subject C to BayesShrink or blurring 
because those methods are unsuitable to monochrome 
images. Therefore we used the extremely simple, "naive" 
denoising method that is described in Section IIVI on 
this specific image instead. The result is shown in 
Figure |6(c)| while it does remove most of the noise, 40 
eiTors remain which is a lot more than those incuiTed by 
the minimal sufficient statistic. Figure|57H^shows that all 
errors except one are close to the contours of the cross. 
This illustrates how the naive algorithm is limited by 
its property that it takes only the local neighbourhood 
of each pixel into account, it cannot represent larger 
structures such as straight lines. 

Experiment 3: "Hoisy mouse (Euclidean distortion): 
The noisy mouse poses a significantly harder denoising 
problem, where the total complexity of the input Nin is 
more than five times as high as for the noisy cross. The 
graphs that show the approximation to the distortion-rate 



function and the codelength function are in Figure |8] 
below we discuss the approximations that are indicated 
on the graphs with references to Figure 

Figure |7(b)| shows the input object Nin; it was con- 
structed by adding noise to the original noiseless image 
NoRiG=MiN. We have displayed the denoising results 
which were obtained in three different ways. In Fig- 
ure ^(c)] we have shown Nbest, the best denoised object 
from the gene pool. Visually it appears to resemble 
NoRiG quite well, but it might be the case that there is 
structure in Nqrig that is lost in the denoising process. 
Because human perception is perhaps the most sensitive 
detector of structure in image data, we have shown the 
difference between Nbest and Nqrig in Figure |7(d)| We 
would expect any significant structure in the original 
image that is lost in the denoising process, as well as 
structure that is not present in the original image, but 
is somehow introduced as an artifact of the denoising 
procedure, to become visible in this residual image. In 
the case of Nbest we cannot make out any particular 
features in the residual. 

We have done the same for the minimal sufficient 
statistic (Figure |7(e)t . The result also appears to be a 
quite successful denoising, although it is clearly of lower 
complexity than the best one. This is also visible in the 
residual, which still does not appear to contain much 
structure, but darker and lighter patches are definitely 
discernible. Apparently the difference between Nin and 
Nmss does contain some structure, but is nevertheless 
coded more efficiently using a literal description than 
using the given compression algorithm. We think that 
the fact that the minimal sufficient statistic is of lower 
complexity than the best possible denoising result should 
therefore again be attributed to inefficiencies of the 
compressor. 

For comparison, we have also denoised N using the 
alternative denoising method BayesShrink and the meth- 
ods based on blurring and JPEG2000 as described in 
Section IIVI We found that BayesShrink does not work 
well for images of such small size: the distortion between 
Nbs and Nin is only 72.9, which means that the input 
image is hardly effected at all. Also, Nbs has a distortion 
of 383.8 to Nqrigj which is hardly less than the distortion 
of 392.1 achieved by Nin itself. 

(Continued after Figures i^llsl l 
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(a) Approximate distortion-rate function and denoising success function. Marked are Cbest (Irft) and Cjn (riglit). We liave La(C 
3178.6, rf(CoRiG,CiN) = 377, La (Cbest) = 260.4 and d(CoRiG, Chest) = 3. 



4000 p 
3500 - 




500 1000 1500 2000 2500 3000 

rate (compressed size in bits) 

(b) Approximate codelength function. Cmss is marked. In tliis case we have Cmss = Cbest- 
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(c) Cnaive- La (Cnaive) = 669.2, (d) Cnaive - Cqrig 

c^CCoRiG. Cnaive) = 40, (i{CiN, Cnaive) = 389. 

Results for tlie noisy Cross (top riglit inset). 





(e) Nmss; 
r = 1969.S 



j(a)l= 337.0, j(b)l= 474.7 




(f) N^ 
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(a) Approximate distortion-rate graph for tlie Noisy Mouse. Nbest> ^jpeg2000 ^'N marked. 
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(b) Approximate codelength function. The minimal sufficient statistic Nmss is marked. 



Fig. 8. Results for the Noisy Mouse 
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Blurring-based denoising yields much better results: 
Nblur is the result after optimisation of the size of the 
Gaussian kernel. Its distortion to Nqrig lies in-between 
the distortions achieved by Nmss and Nbest, but it is 
different from those objects in two important respects. 
Firstly, Nblur remains much closer to Nin, at a distortion 
of 260.4 instead of more than 470, and secondly, Nblur 
is much less compressible by La- (To obtain the reported 
size of 14117 bits we had to switch on the averaging 
filter, as described in Section ITlI- Ah These observations 
are at present not well understood. Figure |7(h)| shows 
that the contours of the mouse are somewhat distorted 
in Nblur; this can be explained by the fact that contours 
contain high frequency information which is discarded 
by the blurring operation as we remarked in Section Hvl 

The last denoising method we compared our results 
to is the one based on the JPEG2000 algorithm. Its 
performance is clearly inferior to our method visually 
as well as in terms of rate and distortion. The result 
seems to have undergone a smoothing process similar to 
blurring which introduces similar artifacts in the back- 
ground noise, as is clearly visible in the residual image. 
As before, the comparison may be somewhat unfair 
because JPEG2000 was not designed for the purpose of 
denoising, might optimise a different distortion measure 
and is much faster 

Experiment 4: Oscar Wilde fragment ( edit distortion ): 
The fourth experiment, in which we analyze Win, a 
corrupted quotation from Oscar Wilde, shows that our 
method is a general approach to denoising that does not 
require many domain specific assumptions. Worig, Win 
and Wmss are depicted in Figure^! the distortion-rate 
approximation, the distortion to Wqrig and the three 
part codelength function are shown in Figure ^2 We 
have trained the compression algorithm by supplying 
it with the rest of Chapters 1 and 2 of the same 
novel as side information, to make it more efficient at 
compressing fragments of English text. We make the 
following observations regarding the minimal sufficient 
statistic; 

• In this experiment, Wmss = Wbest so the minimal 
sufficient statistic separates structure from noise 
extremely well here. 

• The distortion is reduced from 68 errors to only 46 
errors. 26 errors are corrected (A), 4 are introduced 
(T), 20 are unchanged (•) and 22 are changed 
incorrectly (if). 

m The errors that are newly introduced (T) and 
the incorrect changes (if) typically simplify the 
fragment a lot, so that the compressed size may 
be expected to drop significantly. Not surprisingly 



therefore, many of the symbols marked T or if 
are deletions, or modifications that create a word 
which is different from the original, but still correct 
English. The following table Usts examples of the 
last category: 



Line 


WoRIG 


Win 


Wmss 


3 


or 


Nor 


of 


3 


the 


Ghe 


he 


4 


any 


anL 


an 


4 


learned 


JeaFned 


yearned 


4 


course 


cor ze 


core 


5 


then 


ehen 


when 


7 


he 


the 


the 



Since it would be hard for any general-purpose 
mechanical method (that does not incorporate a 
specialized full linguistic model of English) to de- 
termine that these changes are incorrect, we should 
not be surprised to find a number of errors of this 
kind. 

Side Information: Figure |9l shows that the compres- 
sion performance is significantly improved if we provide 
side information to the compression algorithm, and the 
improvement is typically larger if ( 1 ) the amount of side 
information is larger, or (2) if the compressed object is 
more similar to the side information. Thus, by giving 
side information, correct English prose is recognised 
as "structure" sooner and a better separation between 
structure and noise is to be expected. The table also 
shows that if the compressed object is in some way 
different from the side information, then adding more 
side information will at some point become counter- 
productive, presumably because the compression algo- 
rithm will then use the side information to build up false 
expectations about the object to be compressed, which 
can be costly. 

While denoising performance probably increases if 
the amount of side information is increased, it was 
infeasible to do so in this implementation. Recall from 
Section Ell] that he conditional Kolmogorov complexity 
K{y\z) is approximated by LA{zy) — La{z). The time 
required to compute this is dominated by the length 
of z if the amount of side information is much larger 
than the size of the object to be compressed. This could 
be remedied by using a compression algorithm A' that 
operates sequentially from left to right, because the state 
of such an algorithm can be cached after processing the 
side information z; computing LA'{zy) would then be 
a simple matter of recalling the state that was reached 
after processing z and then processing y starting from 
that state. Many compression algorithms, among which 
Lempel-Ziv compressors and most statistical compres- 
sors, have this property; our approach could thus be 
made to work with large quantities of side information 
by switching to a sequential compressor but we have not 
done this. 
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Side information £a(Worig) La(Wmss) £a(Win) 



None 3344.1 3333.7 3834.8 

Chapters 1,2 (57 kB) 1745.7 1901.9 3234.5 

Whole novel (421 kB) 1513.6 1876.5 3365.9 

Fig. 9. Compressed size of models for different amounts of side information. Wq riq is never included in the side information. We do not let 
Wmss vary with side information but keep it fixed at the object reported in Figure [TO(c)| 



Beauty, real beauty, ends where an intellectual expression begins. Intellect is in itself a 
mode of exaggeration, and destroys the harmony of any face. The moment one sits down 
to think, one becomes all nose, or all forehead, or something horrid. Look at the successful 
men in any of the learned professions. How perfectly hideous they are! Except, of course, in 
the Church. But then in the Church they don't think. A bishop keeps on saying at the age 
of eighty what he was told to say when he was a boy of eighteen, and as a natural consequence he always looks 
absolutely delightful. Your mysterious young friend, whose name you have never told me, but whose picture really 
fascinates me, never thinks. I feel quite sure of that. 

(a) WoRiG, the original text 



Beauty, real beauty, ends2wheresan intellectual expressoon begins. IntellHct isg in itself a 
mMde ofSexggeration, an\ destroys theLharmony of n face. :The mlment one sits down 
to ahink@ one becomes jll noe" Nor all forehbead, or something hNrrid. Look a Ghe successf\l 
men in anL of te JeaFned professions. How perjectly tideous 4they re6 Except, of corze, in7 
the Ch4rch. BuP ehen in the Church they dol't bthink. =A bishop keeps on saying at the age 
of eighty what he was told to say wh"n he was aJb4y of eighten, and sja natural cnsequence 
fhe a(ways looks ab8olstely de[ightfu). Your mysterious youngL friend, wPose name you h\vo never tld 
me, mut whose picture really fa?scinates Lme,Pnever thinCs. 1 feel quite surS of that9 

(b) Win, the corrupted version of the fragment. At 68 randomly selected positions characters have been inserted, deleted or modified. 
New and replacement characters are drawn uniformly from ASCII symbols 32-126. 

■k i • 11 

Beauty, real beauty, ends-where an intellectual expressoon begins. Intellect isQ in itself a 

11* 1 1 -A-l 

mode of exaggeration, and destroys the harmony of bub face. aThe moment one sits down 

1* ★•★IT ★ •★ ★ 

to thinkQ one becomes dU noHem Dof all forebead, or something hirrid. Look as Qhe successful 

★ •★1 T 1* 1«* <★ 

men in ans of tae yearned provessions. How perfectly tideous mthey □re6 Except, of coHrae, in 

1 ★I* •l+T 

□ the Charch. But when in the Church they dol't athink. HADbishop keeps on saying at the age 

★ ★★ • • ★ 

of eighty what he was told to say whan he was aabsy of eightaen, and asja natural CQnsequence the 

★ 11*1 1 1 ii«« 

aaways looks absolutely deaightful. Your mysterious youngH friend, whose name you have never tald me, mut 

111 • ★ • 

whose picture really faascinates ame, never thinCs. I feel quite surs of that9 

(c) WgEST = WmsSi it edit distortion 46 to the original fragment. Marks indicate the en'or type: A=coiTection; T=new error; 
•=old eiTor; ★=changed but still wrong. Deletions are represented as □. 

Fig. 10. A fragment of The Picture of Dorian Gray, by Oscar Wilde. 
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(a) Approximate distortion-rate function, Wbest ™d Win ^6 marked. 




(b) Approximate codelengtli function, Wmss is marked 



Fig. 11. Results for a fragment of The Picture of Dorian Gray by Os^ Wilde (also see Figure [Tol . 



VI. Quality of the Approximation 

It is easy to see from its definition that the distortion- 
rate function must be a non-increasing function of the 
rate. The implementation guarantees that our approxi- 
mation is also non-increasing. In [4] it is assumed that 
for every x ^ X there exists a representation y G 3^ 
such that d{x, y) — 0; in the context of this paper this is 
certainly true because we have X — y and a distortion 
function which is a metric. The gene pool is initialised 
with OiN, which always has zero weakness and must 
therefore remain in the pool. Therefore at a rate that 
is high enough to specify x, the distortion-rate function 
reaches zero. 

The shape of the codelength function for an object x is 
more complicated. Let y be the representation for which 
d{x, y) = 0. In theory, the codelength can never become 
less than the complexity of y, and the minimal sufficient 
statistic witnesses the codelength function at the lowest 
rate at which the codelength is equal to the complexity of 
y. Practically, we found in all denoising experiments that 
the total codelength using the minimal sufficient statistic, 
Lq^^^(Oin), is lower than the codelength L^(Oin) that 
is obtained by compressing the input object directly. This 
can be observed in Figures |6(b)| |8(b)| and |10(c)| The 
effect is most clearly visible in |6(b)| where the separation 
between structure and noise is most pronounced. 

Our hypothesis is that this departure from the theoret- 
ical shape of the codelength function must be explained 
by inefficiency of the compression algorithm in dealing 
with noise. This is evidenced by the fact that it needs 
2735.7 bits to encode Cnoise, while only log^ {^°^^) ~ 
1810 bits would suffice if the noise were specified with 
a uniform code on the set of indices of all binary 
sequences with exactly 377 ones out of 64 • 64 (see 
Appendix O. Similarly, ^^(Nnoise) — 14093, whereas 
a literal encoding requires at most 12829 bits (using the 
bound in Appendix IdI. 

Another strange effect occurs in Figure |8(b)| where 
the codelength function displays a strange "bump": as 
the rate is increased beyond the level required to specify 
the minimal sufficient statistic, the codelength goes up as 
before, but here at very high rates the codelength starts 
dropping again. 

It is theoretically possible that the codelength function 
should exhibit such behaviour to a limited extent. It can 
be seen in [4] that a temporary increase in the codelength 
function can occur up to a number of bits that depends on 
the so-called covering coefficient. Loosely speaking this 
is the density of small distortion balls that is required 
in order to completely cover a larger distortion ball. 
The covering coefficient in turn depends on the used 



distortion function and the number of dimensions. It is 
quite hard to analyze in the case of Euclidean distortion, 
so we cannot at present say if theory admits such a large 
increase in the codelength function. However, we believe 
that the explanation is more mundane in this case. Since 
the noisy mouse is the most complex object of the four 
we experimented on, we fear that this bump may simply 
indicate that we interrupted our search procedure too 
soon. Quite possibly, after a few years of running the 
codelength function would have run straight in between 
Nmss and Nin. 

Figure |5(a)| shows that our approximation of the 
distortion-rate function is somewhat better than the ap- 
proximation provided by either JPEG or JPEG2000, 
although the difference is not extremely large for higher 
rates. The probable reason is twofold: on the one 
hand, we do not know for which distortion function 
JPEG(2000) is optimised, but it might well be something 
other than Euclidean distortion. If this is the case, then 
our comparison is unfair because our method might well 
perform worse on JPEG(2000)'s own distortion measure. 
On the other hand, JPEG(2000) is very time-efficient: 
It took only a matter of seconds to compute models at 
various different quality levels, while it took our own 
algorithm days or weeks to compute its distortion-rate 
approximation. Two conclusions can be drawn from our 
result. Namely, if the performance of existing image 
compression software had been better than the perfor- 
mance of our own method in our experiments, this 
would have been evidence to suggest that our algorithm 
does not compute a good approximation to the rate- 
distortion function. The fact that this is not the case is 
thus reassuring. Vice versa, if we assume that we have 
computed a good approximation to the algorithmic rate- 
distortion function, then our results give a measure of 
how close JPEG(2000) comes to the theoretical opti- 
mum; our program can thus be used to provide a basis for 
the evaluation of the performance of lossy compressors. 

VII. Conclusion 

Algorithmic rate-distortion provides a good frame- 
work for analysis of large and structured objects. It is 
based on Kolmogorov complexity, which is not com- 
putable. We nevertheless attempted to put this theory 
into practice by approximating the Kolmogorov com- 
plexity of an object by its compressed size. We also 
generalized the theory in order to enable it to cope with 
side information, which is interpreted as being available 
to both the sender and the receiver in a transmission 
over a rate restricted channel. We also describe how 
algorithmic rate-distortion theory may be applied to lossy 
compression and denoising problems. 
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Finding the approximate rate-distortion function of 
an individual object is a difficult search problem. We 
describe a genetic algorithm that is very slow, but 
has the important advantage that it requires only few 
assumptions about the problem at hand. Judging from 
our experimental results, our algorithm provides a good 
approximation, as long as its input object is of reasonably 
low complexity and is compressible by the used data 
compressor. The shape of the approximate rate-distortion 
function, and especially that of the associated three part 
codelength function, is reasonably similar to the shape 
that we would expect on the basis of theory, but there 
is a striking difference as well: at rates higher than the 
complexity of the minimal sufficient statistic, the three 
part codelength tends to increase with the rate, where 
theory suggests it should remain constant. We expect 
that this effect can be attributed to inefficiencies in the 
compressor. 

We find that the algorithm performs quite well in lossy 
compression, with apparently somewhat better image 
quality than that achieved by JPEG2000, although the 
comparison may not be altogether fair. When applied 
to denoising, the minimal sufficient statistic tends to 
be a slight underestimate of the complexity of the best 
possible denoising (an example of underfitting). This 
is presumably again due to inefficiencies in the used 
compression algorithm. 
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Appendix 

We use an almost completely generic procedure to 
simultaneously optimise two separate objective functions 
for objects that are represented as byte sequences. To 
emphasize this we will consider the abstract objective 
function g wherever possible, rather than the more con- 
crete rate and distortion functions. 



A. Definitions 

We need a number of definitions to facilitate the 
discussion of the algorithm and its results below. A finite 
subset of y is called a pool. A pool V induces a tradeoff 
profile p{P) := {g{y) ■ y' =4 y for some y' in P}. 
The weakness w-p(y) of an object y € V is defined 
as the number of elements of the pool that are smaller 
according to The (transitive) reduction trd(7-') of a 
pool V is the subset of all elements with zero weakness. 
The elements of the reduction of a pool V are called 
models. 



B. Genetic Algorithm 

The search algorithm initialises a pool Vo, which 
is then subjected to a process of selection through 
survival of the fittest: the pool is iteratively updated by 
replacing elements with low fitness (the fitness function 
is specified below) by new ones, which are created 
through either mutation (random modifications of ele- 
ments) or crossover ("genetic" recombination of pairs 
of other candidates). We write Vi to denote the pool 
after i iterations. When the algorithm terminates after n 
iterations it outputs the reduction of Vn- 

In what follows we will describe our choices for the 
important components of the algorithm: the mechanics 
of crossover and mutation, the fitness function and the 
selection function which specifies the probability that a 
candidate is removed from the pool. In the interest of 
reproducibility we faithfully describe all our important 
design choices, even though some of them are somewhat 
arbitrary. 

1) Crossover: Crossover (also called recombination) 
is effected by the following algorithm. Given two objects 
X and y we first split them both in three parts: x = 
X1X2X3 and y = 2/12/22/3, such that the length of xi is 
chosen uniformly at random between and the length 
of X and the length of X2 is chosen from a geometric 
distribution with mean 5; the lengths of the yi are 
proportional to the lengths of the Xi. We then construct 
a new object by concatenating xiy2Xs. 



2) Mutation: The introduction of a mutation oper- 
ation is necessary to ensure that the search space is 
connected, since the closure of the gene pool under 
crossover alone might not cover the entire search space. 
While we could have used any generic mutation function 
that meets this requirement, for reasons of efficiency 
we have decided to design a different mutation function 
for every objective function that we implemented. This 
is helpful because some distortion functions (here, the 
edit distortion) can compare objects of different sizes 
while others cannot: mutation is the means by which 
introduction of objects of different size to the pool 
can be brought about when desirable, or avoided when 
undesirable. 

The mutation algorithm we use can make two kinds of 
change. With probability 1/4 we make a small random 
modification using an algorithm that depends on the 
distortion function. Below is a table of the distortion 
functions and a short description of the associated mu- 
tation algorithms: 



Distortion 


Mutation algoritlim 


Hamming 
Euclidean 
Edit 


Sets a random byte to a uniformly random value 
Adds an A/'[0; cr = 10] value to a random byte 
A random byte is changed, inserted or deleted 



With probability 3/4 we use the following mutation 
algorithm instead. It splits the object x into three parts 
X = X1X2X3 where the length of xi is chosen uniformly 
at random between and the length of x and the 
length of X2 is chosen from a geometric distribution 
with mean 5. The mutation is effected by training a 
(simplified version of) a third order PPM model [16] on 
xi and then replacing X2 with an equally long sequence 
that is sampled from the model. The advantage of this 
scheme is that every replacement for X2 gets positive 
probability, but replacements which have low codelength 
and distortion tend to be much more likely than under a 
uniform distribution. 

3) Fitness Function: In theory, the set of representa- 
tions that witness the rate-distortion function does not 
change under monotonic transformation of either objec- 
tive function gi or (72. We have tried to maintain this 
property throughout the search algorithm including the 
fitness function, by never using either objective function 
directly but only the ordering relation Under such a 
regime, a very natural definition of fitness is minus the 
weakness of the objects with respect to pool P. 

It has been an interesting mini-puzzle to come up with 
an efficient algorithm to compute the weakness of all ob- 
jects in the pool efficiently. Our solution is the following 
very simple algorithm, which has an 0(n log n) average 
case rurming time. It first sorts all elements of the pool by 
their value under gi and then inserts them in order into 
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a binary search tree in which the elements are ordered 
by their value under 52- As an object is inserted into the 
tree, we can efficiently count how many elements with 
lower values for 32 the tree already contained. These 
elements are precisely the objects that have both lower 
values on gi (otherwise they would not appear in the 
tree yet) and on 52; as such their number is the desired 
weakness. 

4) Selection Function: It is not hard to see that 
we have piV) = p(trd(P)) and p{r) C p{r U V). 
Therefore monotonic improvement of the pool under 
modification is ensured as long as candidates with weak- 
ness are never dropped. 

We drop other candidates with positive probability 
as follows. Let yi, . . . ,?/„ be the elements of V with 
nonzero weakness, ordered such that for 1 < i < j < n 
we have w-piVi) < w-p{yj) or gi{y,) < gi{y^) if y, and 
yj have the same weakness. We drop candidate yi from 
the pool with probabihty 1/(1 + (j^yj^ ^ 1)")' which is 
a modified sigmoid function where a G (1, 00) specifies 
the sharpness of the transition from probability zero to 
one. This function is plotted for different values of a in 
Figure [21 We used q = 4 in our experiments. 




0.4 0.6 

(i-1/2)/n 

Fig. 12. Drop probability as a function of the index 

For each distortion function, we compute or bound 
from above the size of a distortion sphere around some 
representation y G y. A sphere with centre y of radius 
a is a set of the form Sy{a) := {x £ X : d{x, y) = a}. 

Distortion spheres are important in the analysis be- 
cause they enable us to link the distortion to the amount 
of discarded information. 



C. The Size of a Hamming Distortion Sphere 

Hamming distortion can be defined when X = y = 
S", sequences of n symbols from a finite alphabet S. 



Let y G 3^ be an object of length n; there is a bijection 
between the elements of Sy{a) and all possible ways to 
replace a symbols in y with different symbols from the 
alphabet. Thus the size of the sphere can be calculated 

as C)(|S|-ir. 

D. The Size of a Euclidean Distortion Sphere 

Our variety of Euclidean distortion requires that X = 
y = Z", the set of n-dimensional vectors of integers. 
The size of a Euclidean distortion sphere around some 
y E y of length n is hard to compute analytically. We 
use an upper bound that is reasonably tight and can be 
computed efficiently. First we define d{v) := d{v, 0) = 
vf and S{n,a) as the set {v : \v\ = n,d{v) = 
a}. We have x E Sy{a) 4^ x — y € S{n,a), so it suffices 
to bound the size of S{n,a). We define: 



(5=255 



p{S\n,a) := 



2/2a' 



where c = 1 / 



2/20'^ 



(5=-255 



1=1 

p{-\n,d{v)) can be interpreted as a probability mass 
function on the individual entries of v (which in our 
application always lie between -255 and 255). There- 
fore P{v) defines a valid probability mass function on 
outcomes v in S{n,a). Thus, 



vGS{n,a) vGS{n,a) 



E 



c e 



'71 /2 



c e 



-n/2 



S{n, a)\ 



This yields a bound on the size of S{n, a), which is 
reasonably tight unless the distortion a is very low. In 
that case, we can improve the bound by observing that 
V must have at least z = n — d{v)'^ zero entries. Let 
v' be a vector of length n ~ z that is obtained by 
removing z zero entries from v. Every v in S{n, a) 
can be constructed by inserting z zeroes into v', so we 
have |>5'j,(a)| = \S{n,a)\ < (") |5(n — z, a)|. The size 
of S{n — z,a) can be bounded by using the method 
described before recursively. 

E. The Size of an Edit Distortion Sphere 

Edit distortion can be defined for spaces X = y = T,* 
for a finite alphabet E. We develop an upper bound on 
the size of the edit distortion sphere. We can identify 
any object in Sy{a) by a program p that operates on 
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y, and which is defined by a list of instructions to 
copy, replace or delete the next symbol from y, or 
to insert a new symbol. We interpret a deletion as a 
replacement with an empty symbol; so the replacement 
operations henceforth include deletions. Let d(p) denote 
the number of insertions and replacements in p. Clearly 
for all a; e Sy{a), there must be a p such that p{y) = 
X and d{p) = d{x,y) = a. Therefore the size of 
Sy{a) can be upper bounded by counting the number of 
programs with d{p) = a. Let n be the length of y. Any 
program that contains i insertions and that processes y 
completely, must be of length n+i. The i insertions, a—i 
replacements and n — a + i copies can be distributed over 
the program in different ways. For each 

insertion and replacement, the number of possibilities is 
equal to the alphabet size. Therefore, 



The sphere can be extremely large, so to facilitate 
calculation of the log of the sphere size, as is required in 
our application, it is convenient to relax the bound some 
more and replace every term in the sum by the largest 
one. Calculation reveals that the largest term has 



\Syia)\ < \{p:d{p)=a}\ 
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